Open In Colab

0.1 IntentΒΆ

The purpose of this project is to conduct an exploratory data analysis (EDA) of the Supermarket EDA Sales dataset in order to identify the relevant factors that affect gross margin percentage and gross income. The dataset includes information on invoice ID, branch, city, customer type, gender, product line, unit price, quantity, tax, total, date, time, payment, cogs, gross margin percentage, gross income, and rating.

The EDA will be conducted using a variety of statistical and graphical techniques, including distplot, histplot, countplot, regplot, boxplot, lineplot, and heatmap. The goal of the EDA is to gain a deeper understanding of the data and identify the factors that have the strongest influence on gross margin percentage and gross income.

The insights gained from the EDA will be used to inform future marketing and sales strategies. The goal is to identify opportunities to increase gross margin percentage and gross income by targeting specific customer segments, product lines, or marketing channels.

0.1.1 Specific ObjectivesΒΆ

The specific objectives of this project are to:

Explore the distribution of the data and identify any outliers. Identify the relationships between the different variables. Determine the factors that have the strongest influence on gross margin percentage and gross income. Generate insights that can be used to improve marketing and sales strategies. Expected Outcomes

The expected outcomes of this project are to:

Gain a deeper understanding of the Supermarket EDA Sales dataset. Identify the factors that have the strongest influence on gross margin percentage and gross income. Generate insights that can be used to improve marketing and sales strategies.

0.1.2 ContextΒΆ

The growth of supermarkets in most populated cities are increasing and market competitions are also high. The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data.

Data Dictionary

Invoice id: Computer generated sales slip invoice identification number

Branch: Branch of supercenter (3 branches are available identified by A, B and C).

City: Location of supercenters

Customer type: Type of customers, recorded by Members for customers using member card and Normal for without member card.

Gender: Gender type of customer

Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel

Unit price: Price of each product in USD

Quantity: Number of products purchased by customer

Tax: 5% tax fee for customer buying

Total: Total price including tax

Date: Date of purchase (Record available from January 2019 to March 2019)

Time: Purchase time (10am to 9pm)

Payment: Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet)

COGS: Cost of goods sold

Gross margin percentage: Gross margin percentage

Gross income: Gross income

Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)

1.0 Import the necessary libraries.ΒΆ

InΒ [1]:
!pip install --upgrade pip

!pip install --upgrade Pillow
Requirement already satisfied: pip in /opt/conda/lib/python3.10/site-packages (23.1.2)
Collecting pip
  Downloading pip-23.2.1-py3-none-any.whl (2.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 48.1 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.2.1
Requirement already satisfied: Pillow in /opt/conda/lib/python3.10/site-packages (9.5.0)
Collecting Pillow
  Obtaining dependency information for Pillow from https://files.pythonhosted.org/packages/3d/36/e78f09d510354977e10102dd811e928666021d9c451e05df962d56477772/Pillow-10.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata
  Downloading Pillow-10.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (9.5 kB)
Downloading Pillow-10.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (3.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 58.4 MB/s eta 0:00:00
Installing collected packages: Pillow
  Attempting uninstall: Pillow
    Found existing installation: Pillow 9.5.0
    Uninstalling Pillow-9.5.0:
      Successfully uninstalled Pillow-9.5.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ydata-profiling 4.3.1 requires scipy<1.11,>=1.4.1, but you have scipy 1.11.1 which is incompatible.
Successfully installed Pillow-10.0.0
InΒ [2]:
!pip install calmap
!pip install ydata_profiling
Collecting calmap
  Obtaining dependency information for calmap from https://files.pythonhosted.org/packages/94/47/4ca1e36da124e0fe17d27937f35809df1841fb8e2977c61ca00df60c2429/calmap-0.0.11-py2.py3-none-any.whl.metadata
  Downloading calmap-0.0.11-py2.py3-none-any.whl.metadata (2.2 kB)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.10/site-packages (from calmap) (3.7.1)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from calmap) (1.23.5)
Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from calmap) (1.5.3)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->calmap) (1.1.0)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.10/site-packages (from matplotlib->calmap) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib->calmap) (4.40.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->calmap) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib->calmap) (21.3)
Requirement already satisfied: pillow>=6.2.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib->calmap) (10.0.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->calmap) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.10/site-packages (from matplotlib->calmap) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas->calmap) (2023.3)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib->calmap) (1.16.0)
Downloading calmap-0.0.11-py2.py3-none-any.whl (7.3 kB)
Installing collected packages: calmap
Successfully installed calmap-0.0.11
Requirement already satisfied: ydata_profiling in /opt/conda/lib/python3.10/site-packages (4.3.1)
Collecting scipy<1.11,>=1.4.1 (from ydata_profiling)
  Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.4/34.4 MB 43.6 MB/s eta 0:00:00
Requirement already satisfied: pandas!=1.4.0,<2.1,>1.1 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (1.5.3)
Requirement already satisfied: matplotlib<4,>=3.2 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (3.7.1)
Requirement already satisfied: pydantic<2,>=1.8.1 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (1.10.9)
Requirement already satisfied: PyYAML<6.1,>=5.0.0 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (6.0)
Requirement already satisfied: jinja2<3.2,>=2.11.1 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (3.1.2)
Requirement already satisfied: visions[type_image_path]==0.7.5 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (0.7.5)
Requirement already satisfied: numpy<1.24,>=1.16.0 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (1.23.5)
Requirement already satisfied: htmlmin==0.1.12 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (0.1.12)
Requirement already satisfied: phik<0.13,>=0.11.1 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (0.12.3)
Requirement already satisfied: requests<3,>=2.24.0 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (2.31.0)
Requirement already satisfied: tqdm<5,>=4.48.2 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (4.65.0)
Requirement already satisfied: seaborn<0.13,>=0.10.1 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (0.12.2)
Requirement already satisfied: multimethod<2,>=1.4 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (1.9.1)
Requirement already satisfied: statsmodels<1,>=0.13.2 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (0.14.0)
Requirement already satisfied: typeguard<3,>=2.13.2 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (2.13.3)
Requirement already satisfied: imagehash==4.3.1 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (4.3.1)
Requirement already satisfied: wordcloud>=1.9.1 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (1.9.2)
Requirement already satisfied: dacite>=1.8 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (1.8.1)
Requirement already satisfied: PyWavelets in /opt/conda/lib/python3.10/site-packages (from imagehash==4.3.1->ydata_profiling) (1.4.1)
Requirement already satisfied: pillow in /opt/conda/lib/python3.10/site-packages (from imagehash==4.3.1->ydata_profiling) (10.0.0)
Requirement already satisfied: attrs>=19.3.0 in /opt/conda/lib/python3.10/site-packages (from visions[type_image_path]==0.7.5->ydata_profiling) (23.1.0)
Requirement already satisfied: networkx>=2.4 in /opt/conda/lib/python3.10/site-packages (from visions[type_image_path]==0.7.5->ydata_profiling) (3.1)
Requirement already satisfied: tangled-up-in-unicode>=0.0.4 in /opt/conda/lib/python3.10/site-packages (from visions[type_image_path]==0.7.5->ydata_profiling) (0.2.0)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2<3.2,>=2.11.1->ydata_profiling) (2.1.3)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib<4,>=3.2->ydata_profiling) (1.1.0)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.10/site-packages (from matplotlib<4,>=3.2->ydata_profiling) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib<4,>=3.2->ydata_profiling) (4.40.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib<4,>=3.2->ydata_profiling) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib<4,>=3.2->ydata_profiling) (21.3)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib<4,>=3.2->ydata_profiling) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.10/site-packages (from matplotlib<4,>=3.2->ydata_profiling) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas!=1.4.0,<2.1,>1.1->ydata_profiling) (2023.3)
Requirement already satisfied: joblib>=0.14.1 in /opt/conda/lib/python3.10/site-packages (from phik<0.13,>=0.11.1->ydata_profiling) (1.2.0)
Requirement already satisfied: typing-extensions>=4.2.0 in /opt/conda/lib/python3.10/site-packages (from pydantic<2,>=1.8.1->ydata_profiling) (4.6.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests<3,>=2.24.0->ydata_profiling) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests<3,>=2.24.0->ydata_profiling) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests<3,>=2.24.0->ydata_profiling) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests<3,>=2.24.0->ydata_profiling) (2023.5.7)
Requirement already satisfied: patsy>=0.5.2 in /opt/conda/lib/python3.10/site-packages (from statsmodels<1,>=0.13.2->ydata_profiling) (0.5.3)
Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from patsy>=0.5.2->statsmodels<1,>=0.13.2->ydata_profiling) (1.16.0)
Installing collected packages: scipy
  Attempting uninstall: scipy
    Found existing installation: scipy 1.11.1
    Uninstalling scipy-1.11.1:
      Successfully uninstalled scipy-1.11.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
momepy 0.6.0 requires shapely>=2, but you have shapely 1.8.5.post1 which is incompatible.
pymc3 3.11.5 requires numpy<1.22.2,>=1.15.0, but you have numpy 1.23.5 which is incompatible.
pymc3 3.11.5 requires scipy<1.8.0,>=1.7.3, but you have scipy 1.10.1 which is incompatible.
Successfully installed scipy-1.10.1
InΒ [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import calmap
#from pandas_profiling import ProfileReport
from ydata_profiling import ProfileReport
# To ignore warinings
import warnings
warnings.filterwarnings('ignore')
<frozen importlib._bootstrap>:241: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
/opt/conda/lib/python3.10/site-packages/numba/core/decorators.py:262: NumbaDeprecationWarning: numba.generated_jit is deprecated. Please see the documentation at: https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-generated-jit for more information and advice on a suitable replacement.
  warnings.warn(msg, NumbaDeprecationWarning)
/opt/conda/lib/python3.10/site-packages/visions/backends/shared/nan_handling.py:51: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def hasna(x: np.ndarray) -> bool:

1.1 Load DatasetΒΆ

InΒ [4]:
df = pd.read_csv('/kaggle/input/supermarket-eda-sales/supermarket_sales.csv')

2.0 Convert the Date column to datetime format.ΒΆ

InΒ [5]:
df['Date'] = pd.to_datetime(df['Date'])
InΒ [6]:
df['Date']
Out[6]:
0      2019-01-05
1      2019-03-08
2      2019-03-03
3      2019-01-27
4      2019-02-08
          ...    
998    2019-02-22
999    2019-02-18
1000   2019-02-18
1001   2019-03-10
1002   2019-01-26
Name: Date, Length: 1003, dtype: datetime64[ns]

2.1 Set the Date column as the index.ΒΆ

InΒ [7]:
df = df.set_index('Date')

3.0 Use seaborn to show the distribution of customer ratings.ΒΆ

InΒ [8]:
sns.distplot(df['Rating'])
plt.axvline(df['Rating'].mean(), color='red',label='mean')
plt.axvline(df['Rating'].quantile(0.25), color='orange',label='25-75th percentile')
plt.axvline(df['Rating'].quantile(0.75), color='orange')
plt.legend()
Out[8]:
<matplotlib.legend.Legend at 0x787ccb568a30>
No description has been provided for this image

3.1 Results:ΒΆ

The plot shows that the distribution is positively skewed, meaning that there are more ratings towards the lower end of the scale. The mean rating is near 7.5, and the 25th and 75th percentiles are 5.5 and 8.4, respectively. This means that 25% of the ratings are below 5.5, and 75% of the ratings are below 8.4. The 95th percentile is 9.1, so 95% of the ratings are below 9.1.

The following are some key takeaways from the visual:

There is a slight dip in the distribution between the 25th and 75th percentiles. The 95th percentile is relatively high, suggesting that there are a few outliers with very high ratings. These observations suggest that the Rating column is generally well-distributed, with a few outliers at the high end.

4.0 Do aggregate sales numbers differ by much between branches?ΒΆ

andΒΆ

Does gross income vary much between branches?ΒΆ

InΒ [9]:
sns.countplot(x=df['Branch'])
Out[9]:
<Axes: xlabel='Branch', ylabel='count'>
No description has been provided for this image
InΒ [10]:
df['Branch'].value_counts()
Out[10]:
A    342
B    333
C    328
Name: Branch, dtype: int64
InΒ [11]:
sns.boxplot(x=df['Branch'],y=df['gross income'])
Out[11]:
<Axes: xlabel='Branch', ylabel='gross income'>
No description has been provided for this image

4.1 Results:ΒΆ

  • The distribution of sales numbers is fairly uniform across the three branches.
  • There is no clear pattern of which branch has the most sales.
  • The difference in sales numbers between the branches is not statistically significant.

5.0 Do aggregate sales numbers differ by much between payment methods?ΒΆ

InΒ [12]:
sns.countplot(x=df['Payment'])
Out[12]:
<Axes: xlabel='Payment', ylabel='count'>
No description has been provided for this image
InΒ [13]:
df['Payment'].value_counts()
Out[13]:
Ewallet        346
Cash           346
Credit card    311
Name: Payment, dtype: int64

5.1 Results:ΒΆ

The distribution of sales numbers is fairly uniform across the three payment methods. There is no clear pattern of which payment method has the most sales.

Overall, the results suggest that there is no significant difference in aggregate sales numbers between the payment methods.

E-wallet is the most popular payment method, followed by cash and credit card. The difference in sales numbers between the payment methods is not statistically significant, suggesting that they are all equally popular.

6.0 Using seaborn regplot to demonstrate the relationship between gross income and customer ratings.ΒΆ

InΒ [14]:
sns.regplot(x='Rating', y='gross income', data=df)
Out[14]:
<Axes: xlabel='Rating', ylabel='gross income'>
No description has been provided for this image

6.1 Results:ΒΆ

The scatter plot shows a negative correlation between rating and gross income. This means that as rating increases, gross income decreases. The correlation is not perfect, however, as there are some outliers. For example, there are a few points with high rating but low gross income, and vice versa.

The following are some possible explanations for the positive correlation between rating and gross income:

Overall gross income stayed below 10 and ratings had no real effect on the amoount of gross income the store would generate.

7.0 Does gender affect gross income?ΒΆ

InΒ [15]:
sns.boxplot(x=df['Gender'],y=df['gross income'])
Out[15]:
<Axes: xlabel='Gender', ylabel='gross income'>
No description has been provided for this image

7.1 ResultsΒΆ

Gender affects gross income just in the slightest, females spend about 1-3 more than men which is not enough to support that gender affects gross income.

8.0 Does the store have a higher gross income depending on the time of year it is?ΒΆ

8.1 Group the data by the Date column and then run the .mean() operation.ΒΆ

InΒ [16]:
df_grouped = df.groupby('Date').mean()
df_grouped
Out[16]:
Unit price Quantity Tax 5% Total cogs gross margin percentage gross income Rating
Date
2019-01-01 54.995833 6.454545 18.830083 395.431750 376.601667 4.761905 18.830083 6.583333
2019-01-02 44.635000 6.000000 11.580375 243.187875 231.607500 4.761905 11.580375 6.050000
2019-01-03 59.457500 4.625000 12.369813 259.766062 247.396250 4.761905 12.369813 8.112500
2019-01-04 51.743333 5.333333 12.886417 270.614750 257.728333 4.761905 12.886417 6.516667
2019-01-05 61.636667 4.583333 14.034458 294.723625 280.689167 4.761905 14.034458 7.433333
... ... ... ... ... ... ... ... ...
2019-03-26 42.972308 4.000000 7.188692 150.962538 143.773846 4.761905 7.188692 6.623077
2019-03-27 56.841000 4.500000 13.822950 290.281950 276.459000 4.761905 13.822950 6.760000
2019-03-28 45.525000 4.800000 10.616200 222.940200 212.324000 4.761905 10.616200 7.050000
2019-03-29 66.346250 6.750000 23.947875 502.905375 478.957500 4.761905 23.947875 6.925000
2019-03-30 67.408182 5.888889 19.424500 407.914500 388.490000 4.761905 19.424500 6.800000

89 rows Γ— 8 columns

8.2 Using the new df_grouped dataset plot a lineplot to demonstrate the correlation between date and gross income.ΒΆ

InΒ [17]:
sns.lineplot(x=df_grouped.index,
             y=df_grouped['gross income'])
Out[17]:
<Axes: xlabel='Date', ylabel='gross income'>
No description has been provided for this image

8.3 Results:ΒΆ

The line plot shows that gross income was relatively low in March. The highest gross income was recorded on February 11, 2019.

The trend of gross income is generally decreasing, but there are some fluctuations. There is a slight peak in gross income around February 10-13, 2019 with an immediate dip to record lows just days later. The lowest gross income was recorded on February 15, 2019. Overall, the line plot shows that gross income from supermarket sales has been decreasing over time. This suggests that the supermarket is doing poor financially.

9.0 Clean duplicated rows and missing valuesΒΆ

9.1 Count the number of duplicate rows in the df DataFrame.ΒΆ

InΒ [18]:
df.duplicated().sum()
Out[18]:
3

9.2 Select and print the duplicate rows in the df DataFrame.ΒΆ

InΒ [19]:
df[df.duplicated()==True]
Out[19]:
Invoice ID Branch City Customer type Gender Product line Unit price Quantity Tax 5% Total Time Payment cogs gross margin percentage gross income Rating
Date
2019-02-18 849-09-3807 A Yangon Member Female Fashion accessories 88.34 7.0 30.919 649.299 13:28 Cash 618.38 4.761905 30.919 6.6
2019-03-10 745-74-0715 A Yangon Normal Male Electronic accessories NaN 2.0 5.803 121.863 20:46 Ewallet 116.06 4.761905 5.803 8.8
2019-01-26 452-04-8808 B Mandalay Normal Male Electronic accessories 87.08 NaN 30.478 640.038 15:17 Cash 609.56 4.761905 30.478 5.5

9.3 Drop the duplicate rows in the df DataFrame.ΒΆ

InΒ [20]:
df.drop_duplicates(inplace=True)

9.4 Calculate the proportion of missing values in the df DataFrame.ΒΆ

InΒ [21]:
df.isna().sum()/len(df)
Out[21]:
Invoice ID                 0.000
Branch                     0.000
City                       0.000
Customer type              0.079
Gender                     0.000
Product line               0.043
Unit price                 0.006
Quantity                   0.019
Tax 5%                     0.000
Total                      0.000
Time                       0.000
Payment                    0.000
cogs                       0.000
gross margin percentage    0.000
gross income               0.000
Rating                     0.000
dtype: float64

9.5 Create a heatmap of the missing values in the df DataFrame.ΒΆ

InΒ [22]:
sns.heatmap(df.isnull(),cbar=False)
Out[22]:
<Axes: ylabel='Date'>
No description has been provided for this image

9.6 Fill all missing values in the df DataFrame with the average value in each column.ΒΆ

InΒ [23]:
df.fillna(df.mean(),inplace=True)

9.7 Fill all missing values in the df DataFrame with the most frequent value in each column.ΒΆ

InΒ [24]:
df.fillna(df.mode().iloc[0],inplace=True)

10.0 Generate a Profile Report for a quick and easy view of duplicatesm missing values, correlations, and more.ΒΆ

InΒ [25]:
dataset = pd.read_csv('/kaggle/input/supermarket-eda-sales/supermarket_sales.csv')
profile = ProfileReport(dataset, title="Pandas Profiling Report")
profile.to_notebook_iframe()

11.0 Correlation AnalysisΒΆ

11.1 Calculate the correlation coefficient between the gross income and rating columns.ΒΆ

InΒ [26]:
round(np.corrcoef(df['gross income'],df['Rating'])[1][0],2)
Out[26]:
-0.04

11.2 ResultsΒΆ

-0.04, indicates a very weak negative correlation between gross income and rating. This means that as gross income increases, rating tends to decrease slightly. However, the correlation is very weak, so it is not clear that there is a causal relationship between the two variables.

11.3 Correlation matrixΒΆ

InΒ [27]:
np.round(df.corr(),2)
Out[27]:
Unit price Quantity Tax 5% Total cogs gross margin percentage gross income Rating
Unit price 1.00 0.01 0.63 0.63 0.63 NaN 0.63 -0.01
Quantity 0.01 1.00 0.70 0.70 0.70 NaN 0.70 -0.02
Tax 5% 0.63 0.70 1.00 1.00 1.00 NaN 1.00 -0.04
Total 0.63 0.70 1.00 1.00 1.00 NaN 1.00 -0.04
cogs 0.63 0.70 1.00 1.00 1.00 NaN 1.00 -0.04
gross margin percentage NaN NaN NaN NaN NaN NaN NaN NaN
gross income 0.63 0.70 1.00 1.00 1.00 NaN 1.00 -0.04
Rating -0.01 -0.02 -0.04 -0.04 -0.04 NaN -0.04 1.00

11.4 ResultsΒΆ

There is a strong positive correlation between unit price and cogs (0.63), between quantity and gross income (0.70), and between tax 5% and total (1.00). This means that as one variable increases, the other variable tends to increase as well.

There are also a number of weak correlations between the variables in the data set. For example, there is a weak positive correlation between rating and gross income (0.01), and a weak negative correlation between rating and cogs (-0.01). This means that as one variable increases, the other variable tends to increase or decrease slightly.

11.5 Heatmap to show correlation between the different columns in the data.ΒΆ

InΒ [28]:
sns.heatmap(np.round(df.corr(),2),annot=True)
Out[28]:
<Axes: >
No description has been provided for this image

11.6 ResultsΒΆ

  • The correlation between unit price and total is not surprising, as these two variables are directly related.
  • The correlation between quantity and total is also not surprising, as these two variables are also directly related.
  • The correlation between tax 5% and total is weak, but it is still significant. This suggests that tax 5% has a small but noticeable impact on total sales.
  • The correlation between cogs and gross margin percentage is weak, but it is negative. This suggests that as cogs increases, gross margin percentage decreases. This is because cogs is a cost, so as cogs increases, there is less money left over for gross income.
  • The correlation between gross margin percentage and gross income is weak, but it is positive. This suggests that as gross margin percentage increases, gross income also increases. This is because gross margin percentage is a measure of profitability, so as gross margin percentage increases, there is more money left over for gross income.